init main repo structure and demonstrate the AR + DiT demo for omni models by hsliuustc0106 · Pull Request #6 · vllm-project/vllm-omni

hsliuustc0106 · 2025-09-25T08:50:08Z

Add comprehensive PRD, architecture design, and test design documents
Implement core modules: OmniLLM, AsyncOmniLLM, stage configurations
Implement CLI integration with --omni flag support
Update dependencies to vLLM 0.10.2 and PyTorch 2.8.0
Implement stage-based processing architecture
Add multimodal output processing capabilities
Support the vllm serve --omni for 1) AR models only and 2) AR +DiT models

Test Plan:

we test the following scenarios:

Model loading and server startup
Health and info endpoints
Text generation functionality
Performance metrics
API client integration
AR → DiT diffusers pipeline

bash scripts/test_serving.sh

Test Results:

[SUCCESS] Import test passed
[INFO] Starting vLLM-omni server on port 8000...
[INFO] Command: vllm serve ./models/Qwen3-0.6B --omni --port 8000
[INFO] Server started with PID: 64101
[INFO] Waiting 15 seconds for server to initialize...
[SUCCESS] Server appears to be running
[INFO] Testing health endpoint...
[SUCCESS] Health check passed: {"status":"healthy","service":"vllm-omni"}
[INFO] Testing text generation...
[SUCCESS] Text generation test passed
Generated text: in a test case. This is a command line test case. The test case will run the server
[INFO] Running performance test...
[SUCCESS] Performance test passed (1s response time)
[INFO] Testing API client example...
[SUCCESS] API client test passed
[INFO] Testing AR → DiT diffusers pipeline example...
[SUCCESS] AR → DiT pipeline test passed (output: logs/ar_dit_pipeline.png)
[SUCCESS] All tests completed successfully!

==========================================

Note

Introduce vLLM-omni multi-stage (AR→DiT) pipeline with CLI vllm --omni, FastAPI server, diffusers-backed diffusion, configs/output processing, examples, tests, and scripts; update dependencies.

Core/Architecture:
- Add multi-stage framework: OmniLLM/AsyncOmniLLM, StageManager, OmniRequest, and implementation docs.
- Introduce configs: OmniStageConfig, DiTConfig, DiTCacheConfig (+ helpers) and remove legacy dit_cache_interface.
- Add multimodal output processing (engine/output_processor.py).
Diffusion (DiT):
- Add diffusers-backed pipeline: engine/diffusion_engine.py, worker/gpu_diffusion_model_runner.py, worker/gpu_diffusion_worker.py.
- Add DiT cache manager (core/dit_cache_manager.py).
CLI & Server:
- New CLI entrypoint intercepting vllm serve --omni (entrypoints/cli/*, pyproject scripts).
- FastAPI server with /generate, /health, /info (entrypoints/api_server.py).
Examples & Scripts:
- Add basic examples and AR→DiT diffusers example with YAML config (examples/*).
- Add comprehensive serving test script and README (scripts/test_serving.sh).
Tests:
- Add unit tests for configs and shared pytest fixtures (tests/*).
Dependencies:
- Bump to vllm>=0.10.2, torch>=2.7; add PyYAML; expose new CLI scripts in pyproject.toml and expand requirements.txt dev tools.

^{Written by Cursor Bugbot for commit 05d2367. This will update automatically on new commits. Configure here.}

- Add comprehensive PRD, architecture design, and test design documents - Implement core modules: OmniLLM, AsyncOmniLLM, stage configurations - Add DiT scheduler and cache manager for diffusion models - Implement CLI integration with --omni flag support - Add API server and plugin system for vLLM integration - Create comprehensive test suite with fixtures - Update dependencies to vLLM 0.10.2 and PyTorch 2.8.0 - Add conda environment setup and package installation - Implement stage-based processing architecture - Add multimodal output processing capabilities This commit establishes the foundation for multi-modality models inference and serving with non-autoregressive structures.

gemini-code-assist · 2025-09-25T08:50:54Z

Summary of Changes

Hello @hsliuustc0106, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request lays the complete groundwork for vLLM-omni, an ambitious extension designed to transform vLLM into a versatile platform for multi-modal and non-autoregressive model inference. It encompasses the entire project setup, from foundational documentation and architectural blueprints to the initial implementation of core processing logic, specialized components for diffusion models, and user-facing interfaces. The changes establish a modular and extensible system capable of orchestrating complex multi-stage AI pipelines, significantly broadening vLLM's capabilities beyond traditional text-based autoregressive generation.

Highlights

Initial Repository Structure: This pull request establishes the foundational directory and file structure for the new vLLM-omni project.
Core Multi-modal Extension: Introduces vLLM-omni as an extension to vLLM, designed to support multi-modal model inference and serving, including non-autoregressive structures and non-textual outputs.
Comprehensive Documentation: Adds extensive documentation, including a Product Requirements Document (PRD), detailed high-level and implementation architecture designs, API design templates, and a comprehensive test design document.
Multi-Stage Processing Framework: Implements core classes like OmniLLM and AsyncOmniLLM to enable sequential, multi-stage processing of requests, integrating both autoregressive (AR) and Diffusion Transformer (DiT) models.
DiT-Specific Components: Introduces specialized components for Diffusion Transformers, such as DiTCacheManager for optimized caching and OmniDiffusionScheduler for DiT-specific scheduling logic.
CLI and API Integration: Provides a command-line interface (CLI) wrapper that intercepts vLLM commands with an --omni flag, and a FastAPI-based API server for online inference.
vLLM Plugin System Integration: Registers vLLM-omni as a plugin within the vLLM ecosystem, allowing for seamless extension and overriding of vLLM's default behavior.
Enhanced Request Handling: Extends the base vLLM request object (OmniRequest) to support multimodal inputs, diffusion parameters, and track processing stages and intermediate outputs.
Output Processing: Implements a MultimodalOutputProcessor to handle and format diverse outputs, including text, images, and latent representations, from various model stages.
Development Environment Setup: Updates requirements.txt with necessary dependencies and development tools, and refactors conftest.py to establish a robust testing environment.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request initializes the repository structure for vllm-omni, a multi-modal extension for vLLM. It includes extensive documentation covering product requirements, architecture, and testing design, along with skeleton code for the core components. The overall structure is well-thought-out and aligns with the project's goals.

My review focuses on identifying potential issues in the initial implementation. I've found a critical import error that will break the code, some incorrect logic in the cache manager and scheduler, and several typos in the documentation. I've also noted a dependency on a non-existent PyTorch version which will cause installation failures. Addressing these points will help build a more robust foundation for the project.

gemini-code-assist · 2025-09-25T08:53:06Z

+    def _create_seq_group_from_request(self, request: Dict[str, Any]) -> Any:
+        """Create a sequence group from a DiT request."""
+        # This would create a proper sequence group
+        # For now, we'll return a mock implementation
+        from vllm.v1.core.sched.sequence import SequenceGroup
+
+        # Mock sequence group creation
+        # In practice, this would properly create a SequenceGroup
+        # with the appropriate metadata for DiT processing
+        return None


The _create_seq_group_from_request method currently returns None. This will cause a TypeError when the return value is used, for example, when it's appended to scheduled_seq_groups and then iterated over. This method should return a valid SequenceGroup object or a placeholder that doesn't break downstream logic.

gemini-code-assist · 2025-09-25T08:53:07Z

+            prompt_str, engine_request, tokenization_kwargs = self._process_stage_inputs(stage_config, **stage_args)
+
+            # Add inputs to Engine
+            stage_engine.add_request(requesy_id, prompt_str, tokenization_kwargs)


There is a typo in 'requesy_id'. It should be 'request_id'.

Suggested change

stage_engine.add_request(requesy_id, prompt_str, tokenization_kwargs)

stage_engine.add_request(request_id, prompt_str, tokenization_kwargs)

gemini-code-assist · 2025-09-25T08:53:07Z

+        response_outputs = []
+        for output in outputs:
+            if hasattr(output, 'outputs') and output.outputs:
+                for out in output.outputs:
+                    response_outputs.append({
+                        "text": getattr(out, 'text', ''),
+                        "finished": getattr(out, 'finish_reason', 'length') != 'length',
+                        "tokens": getattr(out, 'token_ids', [])
+                    })
+            else:
+                response_outputs.append({
+                    "text": "",
+                    "finished": True,
+                    "tokens": []
+                })
+


The response generation logic in the /generate endpoint seems to only handle text-based outputs. It extracts text, finish_reason, and token_ids from the RequestOutput. This is inconsistent with the project's goal of supporting multimodal outputs (like images), and the MultimodalOutputProcessor which is designed to produce outputs with image or latent data. The response model and logic should be updated to handle and serialize multimodal outputs correctly.

- Improve API server with better error handling and response formatting - Enhance CLI with additional options for DiT stages and configuration - Add comprehensive examples in examples/basic/ including: - API client with health checks and text generation - Docker setup and usage examples - Simple usage patterns for different scenarios - Add utility scripts for model downloading and Docker setup - Update documentation with implementation details and testing guidelines - Fix configuration validation issues in OmniLLM - Improve stage configuration handling for AR and DiT stages - Add proper error handling and fallback mechanisms Tested with Qwen3-0.6B model: - Server starts successfully on port 8000 - Health and info endpoints working correctly - Text generation with various parameters functioning - API client examples working as expected - CLI help and configuration options working properly

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

- Move vllm_omni/core/omni_llm.py to vllm_omni/entrypoints/omni_llm.py - Update all import statements across the codebase to reflect new location - Fix relative imports within the moved file - Maintain functionality while improving code organization - All imports and functionality tested and working correctly This change better reflects that OmniLLM and AsyncOmniLLM are the main entry points for vLLM-omni functionality, rather than core implementation details.

- Add test_serving.sh: Full-featured testing suite with comprehensive validation - Add quick_test.sh: Fast validation script for quick testing after changes - Add scripts/README.md: Complete documentation for testing scripts - Include health checks, text generation, performance testing, and API integration - Add retry mechanisms and proper error handling - Support for different models and ports - Comprehensive logging and colored output - Ready for CI/CD integration Usage: - Quick test: ./scripts/quick_test.sh [port] - Full test: ./scripts/test_serving.sh [model_path] [port]

cursor · 2025-09-29T10:14:25Z

+                    input_modalities=["text"],
+                    output_modalities=["text"]
+                )
+                stage_configs.append(ar_config)


Bug: Unreachable Fallback in Stage Configuration

The if not stage_configs: condition is unreachable because an AR stage is always added to stage_configs earlier. This means the fallback logic, intended for when no specific stages are configured, will never execute.

cursor · 2025-09-29T10:14:25Z

+                prompt_logprobs=None,
+                outputs=[mock_output],
+                finished=True
+            )


Bug: Async Class Calls Sync Method

The AsyncOmniLLM class isn't fully asynchronous. It inherits from LLM and its generate_async method calls the synchronous super().generate(), which blocks the event loop. Additionally, within _execute_stage_async, DiffusersPipelineEngine is initialized with parameters that don't align with its constructor signature.

correct _thinker_to_talker_prefill to handle multiple segments inside one chunk

Refactor

P0 fixes: vllm-project#1: _free_scaffold_weights now shrinks storage to zero (actually releases VRAM). Only runs when SKIP_SCAFFOLD is also set. Called lazily after first prefill, not at load time. vllm-project#2: Sliding VAE default OFF (splice algorithm had alignment bug). _sliding_vae_decode now falls back to full decode until proper overlap-add is implemented. vllm-project#3: Complete per-request state reset in preprocess: now clears _curr_prefix_feat_cond, _last_audio_patch_gpu, _prev_audio, _prev_audio_len, _decode_step_count, _precomputed_stop_logits. vllm-project#4: compute_logits fallback forces stop (not continue) when _prefill_completed=True, preventing runaway generation. vllm-project#5: Scaffold VRAM: load_weights no longer frees immediately; _free_scaffold_weights called after first prefill completes, so scaffold is available for prefill then released. P1 fixes: vllm-project#6: Log all active config flags at load time. vllm-project#7: Remove dead _STOP_CHECK_INTERVAL code. vllm-project#8: Remove broken audio_duration formula from postprocess. vllm-project#9/vllm-project#14: Move `from einops import rearrange` to module top level. vllm-project#11: Remove torch.no_grad() context from _forward_decode_graphable (incompatible with CUDA Graph capture).

hsliuustc0106 added 5 commits September 15, 2025 22:46

init request.py

8dbe6db

scheduler api design

0c5caf6

update readme

e12761e

del api md

c1da072

hsliuustc0106 requested review from Gaohan123, congw729 and tzhouam September 25, 2025 08:51

gemini-code-assist Bot reviewed Sep 25, 2025

View reviewed changes